Using devices such as Jawbone Up, Nike Fuel Band, and Fit bit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from acceleromators on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal of this project is to predict the manner in which each subject did their respective exercise. This is the classe variable in the training set. This report aims to describe how the model was built, explain the method of cross validation used, provide an estimation of out of sample error, and to provide a justification of model choices. The model built on the training set will then be used to predict 20 different test cases.
The training data is available here.
The testing data is available here.
The data for this project come from this source.
If you use the this data for any purpose please cite them!
This analysis will attempt to classify the data by using three types of robust classification algorithm:
Random Forest
Boosting
Support Vector Machine (Both linear and radial kernel)
Theses models have been selected due to their high accuracy and manageable complexity. There are algorithms that could offer higher accuracy in the caret package such as mxnet, however due to hardware and time limitations, these may be slightly out of the scope of this project.
These models will be built on a 70% partition of the training data set and bench marked against a validation set, before performing predictions against 20 unlabeled test examples.
These models will be cross validated using the following train control function: trainControl(method = "cv", number = 5). The only other modified parameter is the maximum number of trees allocated to the random forest model which has been limited to 80 in order to reduce computational complexity.
$table
Reference
Prediction A B C D E
A 1674 0 0 0 0
B 7 1127 5 0 0
C 0 5 1019 2 0
D 0 0 7 954 3
E 0 0 0 2 1080
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.9947324 0.9933361 0.9925313 0.9964182 0.2856415
AccuracyPValue McnemarPValue
0.0000000 NaN
$table
Reference
Prediction A B C D E
A 1656 10 3 5 0
B 38 1060 39 2 0
C 0 35 979 9 3
D 1 4 32 918 9
E 7 11 9 15 1040
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
9.605777e-01 9.501083e-01 9.552874e-01 9.654049e-01 2.892099e-01
AccuracyPValue McnemarPValue
0.000000e+00 7.638490e-09
$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: A 0.9729730 0.9956969 0.9892473 0.9890762 0.9892473
Class: B 0.9464286 0.9834208 0.9306409 0.9873578 0.9306409
Class: C 0.9218456 0.9902550 0.9541910 0.9829183 0.9541910
Class: D 0.9673340 0.9906807 0.9522822 0.9937005 0.9522822
Class: E 0.9885932 0.9913097 0.9611830 0.9975016 0.9611830
Recall F1 Prevalence Detection Rate Detection Prevalence
Class: A 0.9729730 0.9810427 0.2892099 0.2813934 0.2844520
Class: B 0.9464286 0.9384683 0.1903144 0.1801189 0.1935429
Class: C 0.9218456 0.9377395 0.1804588 0.1663551 0.1743415
Class: D 0.9673340 0.9597491 0.1612574 0.1559898 0.1638063
Class: E 0.9885932 0.9746954 0.1787596 0.1767205 0.1838573
Balanced Accuracy
Class: A 0.9843349
Class: B 0.9649247
Class: C 0.9560503
Class: D 0.9790074
Class: E 0.9899515
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 7225
Objective Function Value : -1450.051 -1275.306 -1046.239 -628.1268 -1326.171 -881.7941 -1774.954 -1207.525 -1037.249 -1209.01
Training error : 0.210526
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.0138342908497241
Number of Support Vectors : 7042
Objective Function Value : -1128.491 -836.3012 -734.9416 -438.009 -1048.849 -589.373 -761.5491 -1012.77 -718.7094 -628.9886
Training error : 0.068501
$table
Reference
Prediction A B C D E
A 1550 27 34 55 8
B 143 824 68 22 82
C 92 94 804 19 17
D 68 30 112 707 47
E 73 131 70 52 756
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
7.886151e-01 7.310874e-01 7.779569e-01 7.989865e-01 3.272727e-01
AccuracyPValue McnemarPValue
0.000000e+00 3.543329e-53
$table
Reference
Prediction A B C D E
A 1660 5 7 2 0
B 87 997 52 1 2
C 4 41 956 24 1
D 6 3 105 849 1
E 6 13 57 24 982
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
9.250637e-01 9.050519e-01 9.180390e-01 9.316635e-01 2.995752e-01
AccuracyPValue McnemarPValue
0.000000e+00 2.353217e-41